fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
Conversation
|
I wonder if the gorilla/websocket already responds to pings with pongs during // pkg/revdial/revdial.go — grabConn(), line 442
c := wsconnadapter.New(wsConn)
c.Ping()
select {
case ln.connc <- c:
case <-ln.donec:
} |
7031a95 to
fe11347
Compare
|
@gustavosbarreto Good call — I've moved This fix works for the immediate issue, but we'll need to think more deeply about how we want to implement this long-term. Given that we've never had agent integration in other languages, we need to find a way to enable this back-and-forth communication that accounts for all clients. Some of our customers rely on a large number of devices, and in those cases this extra WebSocket ping/pong transmission on every active session could impact performance. |
|
Sharing some research on cloud load balancer behavior that reinforces why this keepalive fix is necessary. All major cloud vendors enforce idle timeouts on load-balanced TCP/WebSocket connections, and some require application-level data (not just TCP keepalives) to reset the timer:
AWS ALB and GCP TCP Proxy both require actual data in the payload to keep the connection alive — TCP-level keepalive packets aren't enough. WebSocket ping/pong frames count as application-level data, which is why our fix in #5947 works. |
|
@gustavosbarreto — requesting your review on this one. The changes go beyond the original per-session keepalive fix and touch shared infrastructure, so I want to make sure nothing has unintended side effects. What to pay attention to1. The old This matters because several code paths can trigger
If any code path currently depends on getting an error back from a second 2. The old nil-check guard ( 3. Gateway Nginx This disables Nginx's read timeout for The concern here is zombie connections: if the application-level keepalive ever stops working (bug, deadlock), Nginx won't clean up the connection — it will hang indefinitely. Previously, the 60s default acted as a safety net (albeit a fragile one). Worth considering whether we want a large finite value (e.g., |
In V1 (revdial) transport, the main control connection has bidirectional keepalive (ping/pong + JSON keep-alive), but per-session WebSocket connections created via dial-back have none. During idle SSH sessions, the only traffic on the per-session WebSocket is the agent's SSH keepalive (agent→server, one-way every 30s). The server→agent direction is completely silent, causing intermediaries (load balancers, NAT, firewalls) to detect a half-idle connection and close it. Call Ping() on the agent-side wsconnadapter in grabConn() to start sending WebSocket ping frames every 30 seconds. gorilla/websocket automatically responds to pings with pong frames during NextReader(), creating bidirectional traffic that keeps the connection alive through all intermediaries. Placing this on the agent side distributes the goroutine cost across agents rather than concentrating it on the server. Fixes: #5946
00ee90c to
aa1f041
Compare
Fix race condition in wsconnadapter Close() where concurrent callers (e.g. pong timeout AfterFunc and normal teardown) could panic on send-to-closed-channel or double-close the WebSocket connection. Use sync.Once to guarantee both channel and connection cleanup happen exactly once.
The previous nil-check guard on pongCh was racy: two concurrent callers could both see nil and create duplicate channels and goroutines, leaking the first set. Use sync.Once to guarantee initialization happens exactly once, consistent with the Close() fix in the previous commit.
aa1f041 to
fc41136
Compare
|
@gustavosbarreto — the branch has been updated and all CI checks are now passing (unit tests + integration tests on both mongo and postgres). The main change since your earlier review: the The remaining changes are:
Could you take another look when you get a chance? |
The sync.Once Close() previously returned nil on repeated calls. Store the error from the first close so callers always receive it.
fc41136 to
18cad39
Compare
The terminal_window_size_change test had a 2s per-attempt timeout and 5s overall deadline for reading stty output over a PTY. Under load, terminal I/O can exceed these tight limits, causing ~40% flakiness locally. Increase to 10s per-attempt and 30s overall. Verified with 20 consecutive runs (0 failures).
Summary
Fix idle SSH session disconnections caused by intermediaries (load balancers, NAT, reverse proxies) closing silent WebSocket connections. This PR addresses #5946 with per-session WebSocket keepalive and race condition fixes in the WebSocket adapter.
Changes
Per-session WebSocket ping/pong (
pkg/revdial)The V1 (revdial) per-session WebSocket connections had no server→agent keepalive. During idle SSH sessions, the only traffic was the agent's SSH keepalive (one-way, every
KEEPALIVE_INTERVALseconds). The server→agent direction was silent, causing intermediaries to detect a half-idle connection and close it.Now
Ping()is called on the agent-side adapter ingrabConn(), sending WebSocket ping frames every 30s. The server automatically responds with pongs (gorilla/websocket handles this), creating bidirectional traffic that keeps the connection alive through all intermediaries.WebSocket adapter
Close()race fix (pkg/wsconnadapter/wsconnadapter.go)The old
Close()had a race condition: concurrent callers (e.g., pong timeoutAfterFuncfiring while normal teardown runs) could panic by sending to an already-closed channel, or callconn.Close()multiple times. Nowsync.Onceensures the entire cleanup — stopping the ping goroutine and closing the WebSocket connection — happens exactly once.Subsequent
Close()calls return the stored error from the first call rather than nil, preserving error semantics for callers that check the return value.WebSocket adapter
Ping()init race fix (pkg/wsconnadapter/wsconnadapter.go)The old
Ping()guarded against re-initialization with a nil check (if a.pongCh != nil), which is racy under concurrent calls — both callers see nil, both create channels and goroutines, leaking the first set. Nowsync.Onceguarantees initialization happens exactly once.Community testing
Note: @ltan10 also reported a separate issue with the main control connection dying at 20-27 min intervals. This is unrelated to per-session keepalive — the main connection already has four independent keepalive sources at 30s. Their issue points to external proxy behavior (max connection duration, connection pool limits).
Test plan
go build— all services compileFixes #5946